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Abstract. A typical approach to developing an analysis algorithm for analyzing gravitational wave 
data is to assume a particular waveform and use its characteristics to formulate a detection criteria. 
Once a detection has been made, the algorithm uses those same characteristics to tease out parameter 
estimates from a given data set. While an obvious starting point, such an approach is initiated by 
assuming a single, correct model for the waveform regardless of the signal strength, observation 
length, noise, etc. This paper introduces the method of Bayesian model selection as a way to select 
the most plausible waveform model from a set of models given the data and prior information. The 
discussion is done in the scientific context for the proposed Laser Interferometer Space Antenna. 

INTRODUCTION 

The anticipated data from the proposed Laser Interferometer Space Antenna (LISA) in- 
troduces a number of exciting and original challenges. Central in these challenges is the 
development of data analysis routines capable of coaxing out and characterizing individ- 
ual signals from the noisy time series LISA will return. A great deal of work has already 
been invested into the development of algorithms applicable to the LISA data. While a 
number of these algorithms have demonstrated favorable capabilities on simulated data, 
each make an initial assumption about the functional form for the waveform under con- 
sideration. This paper introduces the use of Bayesian model selection as a quantitative 
method to selecting the waveform model. Using Bayes' theorem we show how the data 
and prior information picks out the most plausible model from a set of proposed models. 

Gravitational wave data analysis can be loosely described as a three step process 
as depicted in figure 1. In the first step, a signal is detected within a set of noisy 
time streams retrieved from the detector. In step two, the signal is characterized by 
producing estimates for the parameterization variables. Finally, step three is to make 
physical interpretations based on the estimated parameter values. These steps are not 
necessarily mutually exclusive. There are no obvious boundaries and areas of overlap 
do exist. However, each step is necessary when analyzing a detected signal. 

In making the transition form detection to characterization (and quite often in the 
detection process itself) a particular waveform is assumed prior to the investigate. While 
an obvious assumption to make in the early developmental stages for an algorithm, it can 
lead to needless complications and even misidentifications. For example, if a signal is 
characterized by a low signal-to-noise ratio, some of the intricate waveform features can 
be lost in the noise and therefore a simpler model would have sufficed in the analysis. 
In the Bayesian model selection approach presented here, the data and prior information 




Is there a signal present 
in the data? 



Characterization 



How is the signal parameterized 
and what are the estimates for 
the parameters? 




What new science have we 
gained from the data? 



FIGURE 1. Data analysis flow chart. 



justify the selection of a particular waveform model by calculating the most plausible 
model from a proposed library of models. 

Bayesian model selection is not a new methodology, but it is one that has not been 
fully adopted by the still infant gravitational wave community. The aim of this paper is to 
briefly summarize the theory and to discuss possible applications for analyzing the LISA 
data. To this end, the paper first introduces the rules of probability theory, including a 
derivation of Bayes' theorem. It then outlines the necessary calculations for performing 
a model selection procedure. From here we give a simple, qualitative example of its use 
for the LISA data. We conclude by suggesting a few other applications associated with 



We begin by introducing a notation first used by Jeffreys [1]. We will denote the 
statement "the probability that proposition A is true given proposition 5" as P(A\B). 
Similarly, "the joint probability that both A and B are true given C" is denoted by 
P(A 1 B\C). The notation "|C)" is the conditional that proposition C is assumed to be true. 
In Bayesian statistics probability statements such as P(A) are not clear because they do 
not explicitly state their dependencies. Furthermore, all probabilities are conditional. 

Starting with the desiderata that degrees of plausibility are represented by real num- 
bers, the rules for manipulating plausibility statements should agree with common sense, 
and they should be consistent, then it is possible to show that the only two rules are re- 
quired for manipulating probabilities [2]: the Sum Rule, 



LISA. 



BAYESIAN STATISTICS 



Rules of Probability Theory 



P(A+B\C) = P(A\C)+P(B\C) -P(A,B\C) 



(1) 



where the plus sign inside the probability argument means "or", and the Product Rule, 



P(A,B\C)=P(A\C)P(B\A,C). (2) 

By standard Aristotelian logic it must be the case that P(A,B\C) = P(B,A\C). Conse- 
quently, the Product Rule may be re-expressed as 

P{B,A\C)=P(B\C)P(A\B,C). (3) 

Equating the last two expressions results in Bayes' theorem, 

P(A\B,C)=P(A\C)^^. (4) 

Although Bayes' theorem receives the accolades, it is simply a consistency statement 
for the Product Rule. 

In words, Bayes' theorem is often stated as 

Marginal Likelihood 

Posterior = Prior — — - — — — — • 

Global Likelihood 

In this form it is evident that Bayes' theorem quantitatively describes a learning process. 
We start with a prior state of knowledge about proposition A when C is assumed true, 
P(A\C). We then gain new information B, which in return updates our final state of 
knowledge as given by the posterior probability, P(A\B,C). The proportionality factor 
between our prior and posterior states of knowledge is a normalized statement about 
how likely the proposition B will occur given that both A and C are true. 

While Bayes' theorem is a useful byproduct of the Product Rule, the use of the Sum 
Rule is equally important. It is through the Sum Rule that we are able to take a joint 
probability of multiple propositions, and reduce it to a distribution of a smaller subset of 
the larger joint distribution. For example, consider the joint distribution between A and 
a set of n exhaustive 5,'s, given prior information /. From the Sum Rule we have 

P(A,f B,|7) =P{A\I) 
i=l 

= J P(A,5 1 |/)+ J P(A,£5 i |/)- J P(A,5 1 ,£5 i |/), (5) 

i=2 i=2 

where the first equality follows from the Product Rule and the fact that the 5,'s are 
exhaustive. If the fi ( 's are mutually exclusive, that is only one value can be realized at a 
time, then the last term is zero. Repeated applications of the Sum Rule leads to 

P(A\I) = f d P(A,B i \I). (6) 

i=l 

When the 5,'s take on continuous values the above goes over to an integral, 

P(A\I)= [p(A,B\I)dB. (7) 



The process which we have just described is referred to as marginalization. In it we have 
removed a nuisance parameter, B, from a joint distribution by a repeated application of 
the Sum Rule. 



Model Selection 

In model selection the central question that is being addressed is the following: 
"Given a particular set of data, and prior information, which hypothesis from a library 
Jz? = {Hi, . . . ,H(} of hypotheses is the most plausible?" Key to this question are the 
ideas that all prior information is included and that the most plausible hypothesis is based 
on the given data. The hypotheses within a library are either assumed to be exhaustive 
or, by a careful choice in models, the space is made so [3]. 

A model itself consists of a functional form dependent on a vector of parameters 
A, and two probability distributions [4]. The first distribution describes the probability 
distribution for the parameter values given the model prior to the new data, P{X\H a ). 
This is a key point; two models are distinct even if they have the same parameterization 
but different priors about how those parameters are believed to be distributed. The 
second distribution is the probability of a data set given the model and a particular set of 
parameter values, P(D\X,H a ). 

From Bayes' theorem (4), the posterior probability for a particular model is given by 

P(H a \D,I)=P(H a \I) P{ p^\ (8) 

where / symbolizes our unenumerated prior information. The denominator can be 
viewed as a normalization constant, 

l 

P(D\I)= £P(H a \I)P(D\H a ,I). (9) 
a=l 

By investigating the odds ratio between two competing models, we can eliminate the 
need to calculate the normalization constant, 
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P(Hi\D,I) _P(H 1 \I)P(D\H h I) 

(10) 



P(H 2 \D,I) P(H 2 \I)P(D\H 2 J) 
P{D\H h I) 



P(D\H 2 ,I) ■ 

The second line arises by assuming that our prior information does not favor one model 
over the other. The odds ratio gives us a means to directly compare competing models. 
If our library contains more than two models, one model may be used as a reference. For 
example, the reference model may be a constant (i.e. a no signal present model), while 
the remaining library contains a spectrum of waveform models. 

From the odds ratio it is apparent that to compare models in a library only their 
marginal likelihoods need to be calculated. The likelihoods are found by marginalizing, 




FIGURE 2. A pictorial representation for the origins of Occam factors in Bayesian model comparisons. 



over all model parameters, the joint distribution for the data and the model parameters, 
P(D\H a ,I) = J P{D,Xa\I) dl a = J P(la\H a ,I)P(D\l a ,H a ,I) dX a , (11) 

where the second equality follows from the Product Rule. 

If the data is informative, i.e. we have learned something new, then the parameter 
likelihood function, P(D\X a ,H a ,I), will be more peaked than the parameter priors, 

P(Xa\H a ,I). Figure 2 illustrates this for a one dimensional model. In this instance we 
can estimate the marginal likelihood as 

P(D\H a J) « P(D\X ML ,H a ,I) [P{X M L\H a ,I) 8X] . (12) 

Here Xml is the parameter value at the maximum likelihood and 8X is the characteristic 
width for the parameter likelihood function. The term in square brackets is an Occam 
factor, a term that naturally penalizes complicated models. To see this consider a uni- 
form prior, P(X\I) = (AA) -1 , where AX is the interval width for the range of expected 
parameter values before the data is collected. The marginal likelihood is now 

8X 

P(D\H a ,I) « P(D\k ML ,H a ,I) . (13) 

For informative data the Occam factor is always less than unity. Consequently, for a 
complicated model to be favored over a simpler one, the data must justify it by having a 
corresponding larger value for the parameter likelihood function. 

The proceeding argument is quickly extended to multiple dimensions. If the model has 
more than one parameter, then there is a corresponding Occam factor for each parameter, 

P(D\H a ,I) « P(D\l ML ,H a ,I) ^± ■ ■ ■ || , (14) 



where i is the number of parameters. 



As a last point of emphasis, it is not enough to perform a parameter estimation analysis 
and find that A ; = 0, therefore ruling out the model that includes A,. Doing so would 
neglect the Occam factors that arise in Bayesian model selection and are not present in 
a parameter estimation analysis, even a Bayesian analysis. 



As a conceptually trivial but applicable example of Bayesian model selection for the 
LISA mission, consider the detection of a supermassive black hole binary inspiral. For 
black hole binaries with component masses in the range of 10 4 ~ 7 M , LISA will observe 
the binary evolution as the binary sweeps through frequencies from ~0.01 mHz up to 
a few milliHertz (depending on the actual masses). In this same range of frequencies 
is the gravitational wave background formed from the ~ 10 8 solar mass binaries in 
our own galaxy. As the black holes inspiral, their detected signal will overlap with the 
collective galactic background signal. Moreover, at any instant of time the black hole 
binary looks like a monochromatic binary. That is, as a supermassive black hole binary 
with a time to coalescence of t c sweeps past a galactic binary of period T, the two signals 
have a significant overlap for an interval equal to the geometric mean of t c and T [5]. 
Consequently the black hole inspiral signal may be decomposed into a population of 
monochromatic galactic binaries. Such a process is often referred to as a white dwarf 
transform. 

For a gravitational wave data analyst the task is to select which of two models is more 
plausible. The models under consideration are 



Model H\yd is parameterized by IN variables, where N is the number of binaries 
required to describe the apparent inspiral signal. For an inspiral signal between 0.01 and 
1 mHz, N is on the order of 10 4 assuming a binary per frequency bin and for a one year 
observation 1 . Conversely, model Hbh is characterized by only seventeen parameters. 

Estimating the posterior probabilities using equation (14) quickly leads to the conclu- 
sion that the large parameter space associated with the white dwarf population model has 
associated with it an overwhelming number of Occam factors. These Occam factors pe- 
nalize the white dwarf population model and in turn make the plausibility for the model 
extremely low. The black hole model, on the other hand, only has seventeen Occam fac- 
tors and therefore is not as severely penalized. Consequently, although an ensemble of 
galactic binaries could conspire to look like a supermassive black hole binary inspiral, 



A frequency bin Af is equal to one on the observation time, Af = T For a one year observation, 
which is used here, Af = 3.2 x 1(T 8 Hz. 



WHITE DWARF TRANSFORM 




the detected signal is from a population 
of monochromatic galactic binaries 



the detected signal is from a single 
supermassive black hole binary 




the relative probability for such a model is many orders of magnitude less than a model 
that contains a single black hole binary. 



CONCLUDING REMARKS 

The white dwarf transform is an obvious application of Bayesian model selection. More 
informative and interesting examples include using Bayesian model selection as a cri- 
teria for deciding when a signal is present in the data; characterizing complicated but 
detected signals that have low signal-to-noise ratios; and counting the number of de- 
tectable galactic binaries within the larger population. The first application is simply 
answering the question, when does the data justify declaring a detection for a particular 
waveform? The second application is concerned with deciding the information content 
from a weak signal. That is, what features of an emitting system are actually measur- 
able and what features are lost to the noise. Counting the number of detectable galactic 
binaries is one of the few Bayesian model selection applications used in the LISA liter- 
ature [6, 7]. Embedded within Reversible Jump Markov Chain Monte Carlo techniques 
is the use of odds ratios in deciding the number of galactic binaries that are detectable. 

In general, Bayesian model selection gives a logical and quantitative approach to 
directly comparing competing models. By using a model selection procedure we are 
able to maximize the amount of information we can extract from LISA'S data. The 
most plausible model is the one that is most justified by the data and our prior state 
of knowledge prior to the experiment. As progress is made in the development of LISA 
analysis routines it is conceivable that Bayesian approaches will be a central tool. 
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